IBM System x GPFS Storage Server - stfc

IBM System x GPFS Storage Server 

Crispin Keable 

Technical Computing Architect 

1 

© 2012 IBM Corporation

2 

IBM Technical Computing comprehensive portfolio uniquely 

addresses supercomputing and mainstream client needs 

Power 

Systems TM 

Engine for 

faster insights 

IBM Platform 

LSF® 

Family 


HPC 

PureSystems TM 

Integrated expertise 

for improved economics 

GPFS 

System x ® 

Redefining x86 

Blue Gene 

System x 

® 

Extremely fast, 

energy efficient 

supercomputer 

System Storage® 

Smarter storage 

HPC Cloud 


Symphony 

Family 


Cluster 

Manager 

solutions 

GPFS 

Storage Server 

Big data 

storage 

iDataPlex ® 

Fast, dense, flexible 

Technical 

Computing 

for Big Data 

Intelligent Cluster 

Factory-integrated, interoperability-tested system with 

compute, storage, networking and cluster management 


“Perfect Storm” of Synergetic Innovations 

Disruptive Integrated Storage Software: 

Declustered RAID with GPFS reduces 

overhead and speeds rebuilds by ~4-6x 

Performance: POWER, x86 cores 

more powerful than special-use 

controller chips 

High-Speed Interconnect: Clustering & 

storage traffic, including failover 

(PERCS/Power fabric, InfiniBand, or 10GE) 

GPFS Native RAID 

Storage Server 

Big Data Converging with HPC Technology 

Server and Storage Convergence 

Data Integrity, Reliability & Flexibility: 

End-to-end checksum, 2- & 3-fault 

tolerance, application-optimized RAID 

Integrated 

Hardware/Packaging: 

Server & Storage co-packaging 

improves density & efficiency 

Cost/Performance: Software-based 

controller reduces HW overhead & cost, and 

enables enhanced functionality. 


High End 

POWER 

1 Rack performs a 1TB Hadoop 

TeraSort in less than 3 minutes! 

IBM GPFS Native RAID p775: 

High-Density Storage + Compute Server 

• Based on Power 775 / PERCS Solution 

• Basic Configuration: 

• 32 Power7 32-core high bandwidth servers 

• Configurable as GPFS Native RAID storage 

controllers, compute servers, I/O servers or spares 

• Up to 5 Disk Enclosures per rack 

• 384 Drives and 64 quad-lane SAS ports each 

• Capacity: 1.1 PB/rack (900 GB SAS HDDs) 

• Bandwidth: >150 GB/s per rack Read BW 

• Compute Power: 18 TF + node sparing 

• Interconnect: IBM high-BW optical PERCS 

• Multi-rack scalable, fully water-cooled 

4 


5 

How does GNR work? 

Clients 

File/Data Servers 

Custom Dedicated 

Disk Controllers 

NSD File Server 1 

x3650 


JBOD Disk Enclosures 

Migrate RAID 

and Disk 

Management to 

Commodity File 

Servers! 

Clients 





JBOD Disk Enclosures 

FDR IB 

10 GbE 


6 

x3650 M4 

“Twin Tailed” 

JBOD 

Disk Enclosure 

A Scalable Building Block Approach to Storage 

Complete Storage Solution 

Data Servers, Disk (NL-SAS and SSD), Software, InfiniBand and Ethernet 

Model 24: 

Light and Fast 

4 Enclosures, 20U 

232 NL-SAS, 6 SSD 

10 GB/Sec 

Performance based on IOR BM 

Model 26: 

HPC Workhorse! 

6 Enclosures, 28U 

348 NL-SAS, 6 SSD 

12 GB/sec 

High-Density HPC Option 

18 Enclosures 

2 - 42U Standard Racks 

1044 NL-SAS 18 SSD 6 

36 GB/sec 


7 

GPFS Native RAID Feature Detail 

• Declustered RAID 

– Data and parity stripes are uniformly partitioned and distributed across a disk array. 

– Arbitrary number of disks per array (unconstrained to an integral number of RAID stripe widths) 

• 2-fault and 3-fault tolerance 

– Reed-Solomon parity encoding 

– 2 or 3-fault-tolerant: stripes = 8 data strips + 2 or 3 parity strips 

– 3 or 4-way mirroring 

• End-to-end checksum & dropped write detection 

– Disk surface to GPFS user/client 

– Detects and corrects off-track and lost/dropped disk writes 

• Asynchronous error diagnosis while affected IOs continue 

– If media error: verify and restore if possible 

– If path problem: attempt alternate paths 

• Supports live replacement of disks 

– IO ops continue on for tracks whose disks have been removed during carrier service 

7 


Declustering – Bringing parallel performance to disk maintenance 

8 

� Conventional RAID: Narrow data+parity arrays 

– Rebuild can only use the IO capacity of 4 (surviving) disks 

4x4 RAID stripes 

(data plus parity) 

20 disks (5 disks per 4 conventional RAID arrays) 

� Declustered RAID: Data+parity distributed over all disks 

– Rebuild can use the IO capacity of all 19 (surviving) disks 

16 RAID stripes 

(data plus parity) 

Failed Disk 

20 disks in 1 Declustered RAID array 

Failed Disk 

Striping across all arrays, all file 

accesses are throttled by array 2’s 

rebuild overhead. 

Load on files accesses are 

reduced by 4.8x (=19/4) 

during array rebuild. 

8 


9 

Declustered RAID Example 

7 stripes per group 

(2 strips per stripe) 

3 1-fault-tolerant 

mirrored groups 

(RAID1) 

3 groups 

6 disks 

spare 

disk 

7 spare 

strips 

7 disks 

21 stripes 

(42 strips) 

49 strips 


10 

Rebuild Overhead Reduction Example 

time 

failed disk 

Rd Wr 

Rebuild activity confined to just 

a few disks – slow rebuild, 

disrupts user programs 

time 

Rd-Wr 

failed disk 

Rebuild activity spread 

across many disks, less 

disruption to user programs 

Rebuild overhead reduced by 3.5x 


GPFS Native Raid Advantages 

11 

• Lower Cost! 

– Software RAID – No hardware 

storage controller 

• 10-30% lower cost with higher 

performance 

– Off-the-shelf SBODs 

• Generic low-cost disk enclosures 

• Standardized in-band SES management 

– Standard Linux or AIX 

– Generic high-volume servers 

– Component of GPFS 

• Industry Leading Performance 

• Extreme Data Integrity 

– 2- and 3-fault-tolerant erasure 

codes 

• 80% and 73% storage efficiency 

– End-to-end checksum 

– Protection against lost writes 

– Fastest Rebuild times using 

Declustered RAID 

– Declustered RAID – Reduced app load during rebuilds 

• Up to 3x lower overhead to applications 

– Aligned full-stripe writes – disk limited 

– Small writes – backup-node NVRAM-log-write limited 

– Faster than alternatives today – and tomorrow! 


12 


Introducing IBM System x GPFS Storage Server: 

Bringing HPC Technology to the Mainstream 

� Better, Sustained Performance 

- Industry-leading throughput using efficient De-Clustered RAID Techniques 

� Better Value 

– Leverages System x servers and Commercial JBODS 

� Better Data Security 

– From the disk platter to the client. 

– Enhanced RAID Protection Technology 

� Affordably Scalable 

– Start Small and Affordably 

– Scale via incremental additions 

– Add capacity AND bandwidth 

� 3 Year Warranty 

– Manage and budget costs 

� IT-Facility Friendly 

– Industry-standard 42u 19 inch rack mounts 

– No special height requirements 

– Client Racks are OK! 

� And all the Data Management/Life Cycle Capabilities of GPFS – Built in! 


14 

Declustered RAID6 Example 

14 physical disks / 3 traditional RAID6 arrays / 2 spares 14 physical disks / 1 declustered RAID6 array / 2 spares 

failed disks 

Decluster 

data, 

parity 

and 

spare 

failed disks failed disks 

Number of faults per stripe 

Red Green Blue 

0 2 0 

0 2 0 

0 2 0 

0 2 0 

0 2 0 

0 2 0 

0 2 0 

Number of stripes with 2 faults = 7 

failed disks 

Number of faults per stripe 

Red Green Blue 

1 0 1 

0 0 1 

0 1 1 

2 0 0 

0 1 1 

1 0 1 

0 1 0 

Number of stripes with 2 faults = 1 


15 

Where GPFS Storage Server Fits 

Local 

University 

Petroleum 

Media/Ent. 

Financial 

Bio/Life SONAS Science 

DCS3700 

Services 

CAE 

DCS3700+ 

Higher End 

Universities 

Direct Attached 

(DS3000 + V3700) 

Government 

GPFS Storage High End Server 

Research 


Data Protection Designed for 200K+ Drives! 

• Platter-to-Client Protection 

– Multi-level data protection to detect and prevent bad writes and on-disk data loss 

– Data Checksum carried and sent from platter to client server 

• Integrity Management 

– Rebuild 

• Selectively rebuild portions of a disk 

• Restore full redundancy, in priority order, after disk failures 

– Rebalance 

• When a failed disk is replaced with a spare disk, redistribute the free space 

– Scrub 

• Verify checksum of data and parity/mirror 

• Verify consistency of data and parity/mirror 

• Fix problems found on disk 

– Opportunistic Scheduling 

• At full disk speed when no user activity 

• At configurable rate when the system is busy 

16 


Non-Intrusive Disk Diagnostics 

17 

17 

• Disk Hospital: Background determination of problems 

– While a disk is in hospital, GNR non-intrusively and immediately returns 

data to the client utilizing the error correction code. 

– For writes, GNR non-intrusively marks write data and reconstructs it 

later in the background after problem determination is complete. 

• Advanced fault determination 

– Statistical reliability and SMART monitoring 

– Neighbor check 

– Media error detection and correction 


18 

GSS – End-to-end Checksums and Version Numbers 

• End-to-end checksums 

– Write operation 

• Between user compute node and GNR node 

• From GNR node to disk with version number 

– Read operation 

• From disk to GNR node with version number 

• From IO node to user compute node 

Data 

Checksum 

Trailer 

• Version numbers in metadata are used to validate checksum trailers 

for dropped write detection 

– Only a validated checksum can protect against dropped writes 


19 

GSS Data Integrity 

� Silent data corruption 

– Caused by disk off-track writes, dropped writes (e.g., disk 

firmware bugs), or undetected read errors 

� Old adage: “No data is better than bad data” 

� Proper data integrity checking requires end-to-end 

checksum plus dropped write detection. 

read A 

disk returns A 

A 

read A 

disk returns B! 


GNR / Mestor Future Research Directions 

• GNR “ring” configuration – 

– Adaptation of Building Block Approach 

– Shared (Dual Ported) disks 

• Data managed by storage nodes 

– Overlapping cfg: ½ the nodes of std ctlrs 

• Storage node pairs to shared disks 

• Scale out to many storage nodes 

• Global namespace / Disk mgmt 

• Mestor 

– Non Shared disks approach / Network RAID 

• Data striped across storage nodes 

• Storage node to captive disks 

• Scale out to many storage nodes 

• Global namespace / Disk mgmt 

20 

IBM Confidential 

Fabrics 


GPFS Native RAID for System x Proposed Timeline 

V1.0: Getting Started 

-System x Intelligent Clusters “Solution” 

-Ordered Through System x Int. Cluster process 

-Software installed and configured at customer 

location by end-user or IBM Services 

-Support coordinated by Intelligent Clusters 

-Early Access customers 

2012 2013 2014 

21 

First Customer Ship 

ISC12 

V1.5 

-Solution sold via Intelligent Clusters 

-Bug Fixes 

-Support provided via I.C. Standard 

mechanism 

-Upgrade path for current DCS3700 

Customers defined 

-Drive Roll for New Drives (4 TB NL-SAS) 

SC12 v1.0 

Announce 

ISC13 

V1.5 

Announce 

V2.0 

Complete Machine-Type/Model, Fully Supported 

-Plug-n-Play GPFS Appliance 

-GUI for Management 

-Evaluate smaller form factor 

-12 Drives Enclosures? 

IBM Confidential 

V2.5 Miniaturization Release 

-Support Entry-level Based 

upon MESTOR 

-Storage-Rich Servers (internal 

drives) 

-RAID Across the Servers 

2015

IBM System x GPFS Storage Server - stfc

Create successful ePaper yourself

Delete template?

Save as template?